In the previous workbook, we had a classifier which was designed to picks between two specific digits, one that we called signal and the other we called background.
In this assignment, we will want to read in all of the digits, and design a classifier which finds a specific digit (our signal again), but all of the other 9 digits will serve as the background. Our background will naturally be 9 times bigger than our signal (unless we limit it).
* Make a training and testing dataset using the sklearn function **train_test_split** as we did in the previous workbook.
* Use the sklearn estimator LinearSVC to fit the training data, and then predict results for both the test and training data.
Next, you need to get the performance of the estimator on both the training and testing data. To do this, I want you to make a function to calulate various performance metrics, and return the result. The function should look like this:
def binaryPerformance(y,y_pred,y_score):
.... your code goes here
return precision,recall,auc,fpr, tpr, thresholds
In this method,
* y = array of true labels
* y_pred = array of predicted labels from the **predict** method of the LinearSVC estimator
* y_score = array of scores from the **decision_function** method of the LinearSVC estimator
* precision,recall,auc are the calculated values of these metrics
* fpr, tpr, thresholds are lists containing the "false positive rate", "true positve rate", and "threshold"
Call the "binaryPerformance" method for both the training and testing results for your estimator. How do they compare? Look at precision, recall, AUC, and the ROC curve.
import plotly.io as pio
pio.renderers.default='notebook'
from collections import defaultdict
from functools import partial
from itertools import repeat
def nested_defaultdict(default_factory, depth=1):
result = partial(defaultdict, default_factory)
for _ in repeat(None, depth - 1):
result = partial(defaultdict, result)
return result()
The data is in /fs/ess/PAS2038/PHYSICS5680_OSU/data/ch3/.
import pandas as pd
#
# Use "short_" if you don't have much memory
short = ""
#short = "short_"
#
# Read in all of the other digits
dfAll = pd.DataFrame()
for digit in range(10):
print("Processing digit ",digit)
fname = '/fs/ess/PAS2038/PHYSICS5680_OSU/data/ch3/digit_' + short + str(digit) + '.csv'
df = pd.read_csv(fname,header=None)
df['digit'] = digit
dfAll = pd.concat([dfAll, df])
Processing digit 0 Processing digit 1 Processing digit 2 Processing digit 3 Processing digit 4 Processing digit 5 Processing digit 6 Processing digit 7 Processing digit 8 Processing digit 9
At the top of this block we define which of the 10 digits we want to use for our signal.
#
# Define our "signal" digit
digitSignal = 5
dfA = dfAll[dfAll['digit']==digitSignal].copy()
dfB = dfAll[dfAll['digit']!=digitSignal].copy()
dfA['signal'] = 1
dfB['signal'] = 0
print("Length of signal sample: ",len(dfA))
print("Length of background sample: ",len(dfB))
print(f'Shape of signal df: {dfA.shape}')
print(f'Shape of background df: {dfB.shape}')
Length of signal sample: 6313 Length of background sample: 63687 Shape of signal df: (6313, 786) Shape of background df: (63687, 786)
dfA
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | digit | signal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6308 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 6309 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 6310 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 6311 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| 6312 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
6313 rows × 786 columns
You will need to shuffle or randomize the rows of the background data. We already read the digits in, but they were in batches of the the same digit. So we need to shuffle them to mix the digits up. To igure out how to do this, google sklearn shuffle pandas. Note that the "shuffle" method from sklearn creates copies of the orginal dataframe.
# Shuffle the background data dfB here
import numpy as np
print('Before Shuffle')
print(dfB.head())
np.random.shuffle(dfB.values)
print('After Shuffle')
print(dfB.head())
Before Shuffle 0 1 2 3 4 5 6 7 8 9 ... 776 777 778 779 780 781 782 783 \ 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 digit signal 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 [5 rows x 786 columns] After Shuffle 0 1 2 3 4 5 6 7 8 9 ... 776 777 778 779 780 781 782 783 \ 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 digit signal 0 7 0 1 3 0 2 6 0 3 9 0 4 0 0 [5 rows x 786 columns]
Come up with a method to limit the rows of the background data that you use, so that it is the same length (in rows) as the signal dataframe. You will want some easy way to turn this on and off. Run first with the background limited to the same length as the signal. Later you can come back and use all of the background data.
Call your limited background dataframe dfB_use.
rows_in_dfA = dfA.shape[0]
dfB_use = dfB.iloc[:rows_in_dfA,:]
print(f'Shape of dfA: {dfA.shape}')
print(f'Shape of dfB_use: {dfB_use.shape}')
#we can just get the first 1000 rows as the background has been
#shuffled so we should have a mix of all digits in background
Shape of dfA: (6313, 786) Shape of dfB_use: (6313, 786)
After steps 1 and 2 , you will have a signal dataframe, and a randomized background dataframe. You will need to combine these two into a single dataframe. We have done this before, but you can look at pandas concat function. Use the name dfCombined for the combined dataframe.
# Your code goes here
dfCombined = pd.concat([dfA,dfB_use])
print("Size of signal sample ",len(dfA))
print("Size of background sample ",len(dfB_use))
print("Size of combined sample ",len(dfCombined))
print("Shape of combined sample: ",np.shape(dfCombined))
Size of signal sample 6313 Size of background sample 6313 Size of combined sample 12626 Shape of combined sample: (12626, 786)
Next you will want to apply an estimator to this dataset. I want you to make a function that does both the test/train split, and then calls the estimator. The function should look like the following "skeleton". I show the expected inputs and the expected return values. Note we did all of this in the example workbook.
#
# Here is the skeleton of the method
from sklearn.model_selection import train_test_split
#
# The inputs are:
# dfCombined: the input dataframe
# estimator: this should be an sklearn classifier (only LinearSVC or SGDClassifier are expected to be used)
def runFitter(dfCombined,estimator):
#
# First do a test/train split
train_digits,test_digits = train_test_split(dfCombined, test_size = 0.2, random_state=42)
X_train = train_digits.iloc[:,:784].to_numpy()
y_train = train_digits['signal'].to_numpy()
y_train_truedigit = train_digits['digit'].values
X_test = test_digits.iloc[:,:784].to_numpy()
y_test = test_digits['signal'].to_numpy()
y_test_truedigit = test_digits['digit'].values
y_train_digit = y_train
y_test_digit = y_test
# Now fit to our training set
estimator.fit(X_train,y_train)
# Now predict the classes and get the score for our traing set
y_train_pred = estimator.predict(X_train)
y_train_score = estimator.decision_function(X_train)
# Now predict the classes and get the score for our test set
y_test_pred = estimator.predict(X_test)
y_test_score = estimator.decision_function(X_test)
return y_train,y_train_pred,y_train_score,y_train_truedigit,y_test,y_test_pred,y_test_score,y_test_truedigit
Now we can use the function we defined above. We have to define our estimate (which we get from sklearn) outside of the method, and pass it as an argument to our defined function. We do it this way because later on we will want to call the method with a different estimator
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
#estimator = LinearSVC(random_state=42) #,dual=False,max_iter=5000) # use dual=False when n_samples > n_features which is what we have
estimator = LinearSVC(random_state=42,dual=False,max_iter=10000) # use dual=False when n_samples > n_features which is what we have
y_train,y_train_pred,y_train_score,y_train_truedigit,y_test,y_test_pred,y_test_score,y_test_truedigit = runFitter(dfCombined,estimator)
Next, you need to get the performance of the estimator on both the training and testing data. To do this, I want you to make a function to calulate various performance metrics, and return the result. The function should look like this:
def binaryPerformance(y,y_pred,y_score):
.... your code goes here
return precision,recall,auc,fpr, tpr, thresholds
In this method,
* y = array of true labels
* y_pred = array of predicted labels from the **predict** method of the LinearSVC estimator
* y_score = array of scores from the **decision_function** method of the LinearSVC estimator
* precision,recall,auc are the calculated values of these metrics
* fpr, tpr, thresholds are lists containing the "false positive rate", "true positve rate", and "threshold"
#
# Determine the performance
#
# The inputs:
# y = array of true labels
# y_pred = array of predicted labels from the **predict** method of the LinearSVC estimator
# y_score = array of scores from the **decision_function** method of the LinearSVC estimator
#
# The return values:
# precision,recall,auc are the calculated values of these metrics
# fpr, tpr, thresholds are lists containing the "false positive rate", "true positve rate", and "threshold"
#
def binaryPerformance(y,y_pred,y_score,debug=False):
# Assuming a binary classifier with 1=signal, 0=background
confusionMatrix = nested_defaultdict(int,2)
for i in range(len(y_pred)):
trueClass = y[i] # this is either 0 or 1
predClass = y_pred[i]
confusionMatrix[trueClass][predClass] += 1
# Our case
data = [ [ confusionMatrix[1][1],confusionMatrix[1][0] ],
[ confusionMatrix[0][1],confusionMatrix[0][0] ] ]
df = pd.DataFrame(data)
df.rename(columns={0:"Pred=DigitA", 1:"Pred=DigitB"},index={0:'True=DigitA',1:'True=DigitB'},inplace=True)
print()
print("Our confusion matrix as a dataframe:")
print(df)
TP = confusionMatrix[1][1]
FP = confusionMatrix[0][1]
FN = confusionMatrix[1][0]
TN = confusionMatrix[0][0]
if debug:
print("TP predicted true, actually true ",TP)
print("FP predicted true, acutally false ",FP)
print("TN predicted false, actually false ",TN)
print("FN predicted false, actually true ",FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2.0 / ( (1.0/precision) + (1.0/recall) )
if debug:
print("Precision = TP/(TP+FP) = fraction of predicted true actually true ",precision)
print("Recall = TP/(TP+FN) = fraction of true class predicted to be true ",recall)
print("F1 score = ",f1_score)
# Get the ROC curve. We will use the sklearn function to do this
from sklearn import metrics
#print(f'size of y: {np.size(y)}')
#print(f'size of y score: {np.size(y_score)}')
fpr, tpr, thresholds = metrics.roc_curve(y,y_score,pos_label=1)
#print(f'size of fpr: {np.size(fpr)}')
#print(f'size of tpr: {np.size(tpr)}')
#print(f'size of thresholds: {np.size(thresholds)}')
# Get the auc
auc = metrics.auc(fpr,tpr)
if debug:
print("AUC this sample: ",auc_train)
return precision,recall,auc,fpr, tpr, thresholds
Call the "binaryPerformance" method for both the training and testing results for your estimator.
Compare them in the following metrics:
#Now get the performaance
print('Test Set info')
precision_test,recall_test,auc_test,fpr_test, tpr_test, thresholds_test = binaryPerformance(y_test,y_test_pred,y_test_score)
print()
print('Train Set info')
precision_train,recall_train,auc_train,fpr_train, tpr_train, thresholds_train = binaryPerformance(y_train,y_train_pred,y_train_score)
print()
print("Precision training data: ",precision_train)
print("Recall training data: ",recall_train)
print("AUC training data: ",auc_train)
print()
print("Precision testing data: ",precision_test)
print("Recall testing data: ",recall_test)
print("AUC testing data: ",auc_test)
Test Set info
Our confusion matrix as a dataframe:
Pred=DigitA Pred=DigitB
True=DigitA 1216 100
True=DigitB 98 1112
Train Set info
Our confusion matrix as a dataframe:
Pred=DigitA Pred=DigitB
True=DigitA 4766 231
True=DigitB 253 4850
Precision training data: 0.9495915521020124
Recall training data: 0.9537722633580148
AUC training data: 0.9870571372806047
Precision testing data: 0.9254185692541856
Recall testing data: 0.9240121580547113
AUC testing data: 0.9644873018664121
# This "list" will store our results
results = []
#
# Loop over the test-dataset results
for fpr,tpr,thresh in zip(fpr_test, tpr_test, thresholds_test):
#
# Put the results in a dictionary
this_dict = {'FPR':fpr,'TPR':tpr,'Threshold':thresh,'Type':'TEST'}
#
# Append this result to the big list
results.append(this_dict)
for fpr,tpr,thresh in zip(fpr_train, tpr_train, thresholds_train):
#
# Put the results in a dictionary
this_dict = {'FPR':fpr,'TPR':tpr,'Threshold':thresh,'Type':'TRAIN'}
#
# Append this result to the big list
results.append(this_dict)
# Now convert this to a dataframe - it's easy!
df_results = pd.DataFrame(results)
df_results = df_results.sort_values(["Type", "FPR"])
import plotly.express as px
fig = px.line(df_results,x='FPR',y='TPR',color='Type',
hover_data={'Threshold'},title='ROC Curve')
fig.show()
Run with a different classifier (and the large background statistics): the SGDClassifier. Compare your results to the LinearSVC classifier.
Compare for the test set, SGDClassifier vs LinearSVC:
# Your code goes here
# run the fitter
from sklearn.linear_model import SGDClassifier
estimatorSGD = SGDClassifier(random_state=42) #estimator = SGDClassifier(random_state=42) #,class_weight="balanced")
#estimatorSGD = SGDClassifier(random_state=42,class_weight="balanced")
y_sgd_train,y_sgd_train_pred,y_sgd_train_score,y_sgd_train_truedigit,y_sgd_test,y_sgd_test_pred,y_sgd_test_score,y_sgd_test_truedigit = runFitter(dfCombined,estimatorSGD)
# Test the fitter
# Now get the performance
precision_sgd_test,recall_sgd_test,auc_sgd_test,fpr_sgd_test, tpr_sgd_test, thresholds_sgd_test = binaryPerformance(y_sgd_test,y_sgd_test_pred,y_sgd_test_score)
precision_sgd_train,recall_sgd_train,auc_sgd_train,fpr_sgd_train, tpr_sgd_train, thresholds_sgd_train = binaryPerformance(y_sgd_train,y_sgd_train_pred,y_sgd_train_score)
print()
print("Precision training data: ",precision_train)
print("Recall training data: ",recall_train)
print("AUC training data: ",auc_train)
print()
print("Precision testing data: ",precision_test)
print("Recall testing data: ",recall_test)
print("AUC testing data: ",auc_test)
print()
print("Precision SGD training data: ",precision_sgd_train)
print("Recall SGD training data: ",recall_sgd_train)
print("AUC SGD training data: ",auc_sgd_train)
print()
print("Precision SGD testing data: ",precision_sgd_test)
print("Recall SGD testing data: ",recall_sgd_test)
print("AUC SGD testing data: ",auc_sgd_test)
Our confusion matrix as a dataframe:
Pred=DigitA Pred=DigitB
True=DigitA 1114 202
True=DigitB 53 1157
Our confusion matrix as a dataframe:
Pred=DigitA Pred=DigitB
True=DigitA 4313 684
True=DigitB 141 4962
Precision training data: 0.9495915521020124
Recall training data: 0.9537722633580148
AUC training data: 0.9870571372806047
Precision testing data: 0.9254185692541856
Recall testing data: 0.9240121580547113
AUC testing data: 0.9644873018664121
Precision SGD training data: 0.968343062415806
Recall SGD training data: 0.8631178707224335
AUC SGD training data: 0.9792250031578814
Precision SGD testing data: 0.9545844044558698
Recall SGD testing data: 0.8465045592705167
AUC SGD testing data: 0.9622133185962972
# Loop over the test-dataset results
for fpr,tpr,thresh in zip(fpr_sgd_test, tpr_sgd_test, thresholds_test):
#
# Put the results in a dictionary
this_dict = {'FPR':fpr,'TPR':tpr,'Threshold':thresh,'Type':'SGD_TEST'}
#
# Append this result to the big list
results.append(this_dict)
for fpr,tpr,thresh in zip(fpr_sgd_train, tpr_sgd_train, thresholds_sgd_train):
#
# Put the results in a dictionary
this_dict = {'FPR':fpr,'TPR':tpr,'Threshold':thresh,'Type':'SGD_TRAIN'}
#
# Append this result to the big list
results.append(this_dict)
# Now convert this to a dataframe - it's easy!
df_results = pd.DataFrame(results)
df_results = df_results.sort_values(["Type", "FPR"])
import plotly.express as px
fig = px.line(df_results,x='FPR',y='TPR',color='Type',
hover_data={'Threshold'},title='ROC Curve')
fig.show()
For a given signal digit, is the accuracy with which background is rejected, dependent upon what the background digit is? For example, given that our signal digit is 4, do you expect that the accuracy that an 8 is identifed as background the same as the accuracy that a 1 is identified as background? Probably not! To answer this:
Limit the signal to 1/10 of its maximum, and the total background to the same number. (So sigbnal and background have equal size) Compare the performance of the estimator using AUC. Is it worse than what we obtained above? Use LinearSVC for the estimator.
count = np.zeros(10)
for i in range(10):
count[i] = np.size(np.where(y_test_truedigit==i))
print(count)
[ 128. 137. 118. 148. 121. 1316. 116. 156. 156. 130.]
print(count[0])
print(y_test_pred[np.where(y_test_truedigit==0)])
#I search for where the digit is zero, then that gets fed into the array that has the predictions to see if it is
#the signal or not
print(np.size(np.where(y_test_pred[np.where(y_test_truedigit==0)] == 0 )))
#I then get the number of times that 0 was counted as background
128.0 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0] 117
correct = np.zeros(10)
for i in range(10):
correct[i] = np.size(np.where(y_test_pred[np.where(y_test_truedigit==i)] == 0 ))
if i == digitSignal: #the variable digitSignal comes from the first code cell in the notebook
correct[i] = count[i] - correct[i]
#we do need to correct the value for our signal digit
#it found the times where we labeled our signal digit as a background so we subtract the count
#from it and we will get how many times the signal digit was labeled properly
print(correct)
[ 117. 135. 112. 122. 115. 1216. 99. 149. 137. 126.]
eff = np.zeros(10)
for i in range(10):
eff[i] = correct[i]/count[i]
print(eff)
[0.9140625 0.98540146 0.94915254 0.82432432 0.95041322 0.92401216 0.85344828 0.95512821 0.87820513 0.96923077]
Rank the above by accuracy
for i in range(10):
if i == digitSignal:
print(f'The efficiency for signal digit {i}: {eff[i]}')
else:
print(f'The efficiency for background digit {i}: {eff[i]}')
The efficiency for background digit 0: 0.9140625 The efficiency for background digit 1: 0.9854014598540146 The efficiency for background digit 2: 0.9491525423728814 The efficiency for background digit 3: 0.8243243243243243 The efficiency for background digit 4: 0.9504132231404959 The efficiency for signal digit 5: 0.9240121580547113 The efficiency for background digit 6: 0.853448275862069 The efficiency for background digit 7: 0.9551282051282052 The efficiency for background digit 8: 0.8782051282051282 The efficiency for background digit 9: 0.9692307692307692
cut_dfA = int(np.round(dfA.shape[0]/10))
dfA_cut = dfA.iloc[:cut_dfA,:]
dfB_cut = dfB.iloc[:cut_dfA,:]
print(f'Shape of dfA_cut: {dfA_cut.shape}')
print(f'Shape of dfB_cut: {dfB_cut.shape}')
Shape of dfA_cut: (631, 786) Shape of dfB_cut: (631, 786)
dfCombined_cut = pd.concat([dfA_cut,dfB_cut])
# Your code goes here
estimator = LinearSVC(random_state=42,dual=False,max_iter=10000) # use dual=False when n_samples > n_features which is what we have
y_cut_train,y_cut_train_pred,y_cut_train_score,y_cut_train_truedigit,y_cut_test,y_cut_test_pred,y_cut_test_score,y_cut_test_truedigit = runFitter(dfCombined_cut,estimator)
# Now get the performaance
print('Test Set info')
precision_cut_test,recall_cut_test,auc_cut_test,fpr_cut_test, tpr_cut_test, thresholds_cut_test = binaryPerformance(y_cut_test,y_cut_test_pred,y_cut_test_score)
print()
print('Train Set info')
precision_cut_train,recall_cut_train,auc_cut_train,fpr_cut_train, tpr_cut_train, thresholds_cut_train = binaryPerformance(y_cut_train,y_cut_train_pred,y_cut_train_score)
print()
print("Precision cut training data: ",precision_cut_train)
print("Recall cut training data: ",recall_cut_train)
print("AUC cut training data: ",auc_cut_train)
print()
print("Precision cut testing data: ",precision_cut_test)
print("Recall cut testing data: ",recall_cut_test)
print("AUC cut testing data: ",auc_cut_test)
print()
print("Precision training data: ",precision_train)
print("Recall training data: ",recall_train)
print("AUC training data: ",auc_train)
print()
print("Precision testing data: ",precision_test)
print("Recall testing data: ",recall_test)
print("AUC testing data: ",auc_test)
Test Set info
Our confusion matrix as a dataframe:
Pred=DigitA Pred=DigitB
True=DigitA 121 17
True=DigitB 12 103
Train Set info
Our confusion matrix as a dataframe:
Pred=DigitA Pred=DigitB
True=DigitA 493 0
True=DigitB 0 516
Precision cut training data: 1.0
Recall cut training data: 1.0
AUC cut training data: 1.0
Precision cut testing data: 0.9097744360902256
Recall cut testing data: 0.8768115942028986
AUC cut testing data: 0.9517958412098299
Precision training data: 0.9495915521020124
Recall training data: 0.9537722633580148
AUC training data: 0.9870571372806047
Precision testing data: 0.9254185692541856
Recall testing data: 0.9240121580547113
AUC testing data: 0.9644873018664121
When the data is cut, the AUC is smaller than with a larger data set. Thus, the performance of classifying our signal is worse with less data